Bug 561546

Summary:

aisexec spins when semaphores run out

Product:

Red Hat Enterprise Linux 5

Reporter:

David Teigland <teigland>

Component:

openais

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.5

CC:

cluster-maint, gbarros, iannis, jkortus, jwest, sdake, tao

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

openais-0.80.6-29.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

626910 (view as bug list)

Environment:

Last Closed:

2011-07-21 07:46:43 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

613551

Bug Blocks:

626910, 694180

Attachments:

Description	Flags
MMapped shared memory	none
Support for POSIX semaphores	none
IPC + posix semaphores hardening	none
Support for POSIX semaphores - try2 - add LDFLAGS to libcpg, libconfdb	none
Proposed patch - previous 3 patches merged together	none
Proposed patch - previous 3 patches merged together and removed /var/run from shm path	none
Proposed patch	none
Test case	none

Description David Teigland 2010-02-03 21:31:39 UTC

Description of problem:

When the system semaphores are exhausted, aisexec enters a tight spin which makes debugging/diagnosing the problem difficult.  It would be nice if aisexec
could limit the retries it does to fail a little more gracefully, and maybe log an error message about running out of semaphores so people know what to do without searching or contacting support.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Jan Friesse 2010-04-19 13:38:51 UTC

Dave,
I'm not totally sure what problem you mean. Is that problem with *_init functions (cpg_init, ...) in client waiting forever for semaphore or there is really some problem with aisexec server?

If problem is with client, I'm not sure what solution to choose. Of course function can return something like ERR_RETRY, ERR_LIBRARY, ... but isn't that ABI/API breaker? Some kind of logging in client seems for me even harder becase we don't have any API for logging in client apps.

Honza

(In reply to comment #0)
> Description of problem:
> 
> When the system semaphores are exhausted, aisexec enters a tight spin which
> makes debugging/diagnosing the problem difficult.  It would be nice if aisexec
> could limit the retries it does to fail a little more gracefully, and maybe log
> an error message about running out of semaphores so people know what to do
> without searching or contacting support.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:

Comment 3 David Teigland 2010-04-19 15:00:13 UTC

Any library call that involves creating a semaphore can fail when the semaphores run out; I don't know what calls those are.  When that happens, the library call should return a fatal error (not retry since retrying is pointless), ideally some error which is unique or can be generally identified with the semaphores.  Most if not all of my apps that use cpg will log the error that's returned so we can easily tell.  Right now no error is returned when semaphores run out, and the caller ends up stuck, looping or spinning forever.

In addition to library calls, if there is any non-library code in openais that creates semaphores, it should obviously check for errors and log and error.

Comment 4 Jan Friesse 2010-04-19 15:26:30 UTC

Dave,
semaphores should be created always first in lib and after that in server. This is why I asked if you know about something specific (reproducer) in server.

Even retrying might seems pointless, it is not SO much true. You can always run ipcrm command for remove of dead semaphores or kill app running semaphore and after that stucked app will continue start work without any problem.

Of course adding special error code is not big issue, but from my point of view, changed behavior is ABI/API breaker.

Steve. Any idea there?

(In reply to comment #3)
> Any library call that involves creating a semaphore can fail when the
> semaphores run out; I don't know what calls those are.  When that happens, the
> library call should return a fatal error (not retry since retrying is
> pointless), ideally some error which is unique or can be generally identified
> with the semaphores.  Most if not all of my apps that use cpg will log the
> error that's returned so we can easily tell.  Right now no error is returned
> when semaphores run out, and the caller ends up stuck, looping or spinning
> forever.
> 
> In addition to library calls, if there is any non-library code in openais that
> creates semaphores, it should obviously check for errors and log and error.

Comment 5 Steven Dake 2010-04-19 15:37:46 UTC

Honza,

There are two separate ways to fix this problem.

Method #1 involves changing the semaphore usage to use unnamed posix semaphores.  This prevents our semaphores from running out and avoids interactions with third party ISV applications.

Method #2 involves returning an error in util.c:openais_service_connect:336.  Instead of looping on error, return an error such as ERR_LIBRARY.

To duplicate, create something like 150 or so ipc connections.

Note this problem doesn't exist in corosync because it uses posix unnamed semaphores.

Comment 6 Jan Friesse 2010-05-31 13:23:21 UTC

*** Bug 584574 has been marked as a duplicate of this bug. ***

Comment 7 Jan Friesse 2010-05-31 13:26:51 UTC

Created attachment 418296 [details]
MMapped shared memory

Support for mmaped shared memory. Even problem is not reported on SYS V shm, it is just matter of time until somebody will find that SYS V semaphores problem solved, but there is another one with shm.

Comment 8 Jan Friesse 2010-05-31 13:28:23 UTC

Created attachment 418297 [details]
Support for POSIX semaphores

Posix semaphores doesn't use SYS V semaphores resources, so it should fix the bug with keep functionality. Apply on top of MMAP patch.

Comment 9 Jan Friesse 2010-05-31 13:29:45 UTC

Created attachment 418298 [details]
IPC + posix semaphores hardening

Backport of almost same patch as in corosync for solve a segfault
issue in openais on exit.

Semop implementation doesn't has this problem because semop is little
more clever and doesn't segfault on uninitialized semaphore.

Apply on top of Support for POSIX semaphores patch.

Comment 11 Steven Dake 2010-06-07 00:33:23 UTC

nice work Honza

Regards
-steve

Comment 15 Jan Friesse 2010-07-01 13:24:24 UTC

Created attachment 428413 [details]
Support for POSIX semaphores - try2 - add LDFLAGS to libcpg, libconfdb

Posix semaphores doesn't use SYS V semaphores resources, so it should fix the
bug with keep functionality. Apply on top of MMAP patch.

It also adds LDFLAGS -lpthread to libcpg and libconfdb linking, so new version works correctly with old applications.

Comment 17 Jan Friesse 2010-07-08 14:02:27 UTC

Created attachment 430373 [details]
Proposed patch - previous 3 patches merged together

Patch contains previous 3 patches merged together and also "backport" for preallocation of file.

Comment 18 Jan Friesse 2010-07-14 11:51:29 UTC

Created attachment 431741 [details]
Proposed patch - previous 3 patches merged together and removed /var/run from shm path

Patch contains previous 3 patches merged together and also "backport" for
preallocation of file.

Only change since previous patch is remove "/var/run" from valid shm paths (so only /dev/shm is supported now).

Comment 20 Steven Dake 2010-08-24 17:10:38 UTC

please note this bugzilla will provide method #2 to offer some limited relief for the customer issue of complete deadlock.  Instead, applications that properly check errors will print them to the customer allowing them to deduce the system limits have been reached.

Comment 21 Nate Straz 2010-09-23 19:55:04 UTC

Looks like we should be able to verify this one.

Comment 27 RHEL Program Management 2011-01-11 20:40:18 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 28 RHEL Program Management 2011-01-11 23:15:29 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 29 Steven Dake 2011-01-25 23:16:49 UTC

*** Bug 626910 has been marked as a duplicate of this bug. ***

Comment 30 Jan Friesse 2011-03-17 16:56:30 UTC

Created attachment 486065 [details]
Proposed patch

open_ais_service_connect creates SYS V semaphores and shms.
If system limit for semaphores/shms is exceeded code looped
in endless cycle.

Now ENOSPC is correctly handled and SA_AIS_ERR_NO_SPACE is
returned to lib user.

Comment 31 Jan Friesse 2011-03-17 17:00:41 UTC

Created attachment 486066 [details]
Test case

Test case for patch:

Output without patch:

[root@node-08 a]# ./without
Cpg initialize 0
Cpg initialize 1
...
Cpg initialize 126
Cpg initialize 127


strace $PID:
...
semget(0x22969f66, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32()                             = 0
semget(0x16018da9, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32()                             = 0
semget(0x7732a8b1, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32()                             = 0
...


With patch:

[root@node-08 a]# ./with
Cpg initialize 0
Cpg initialize 1
...
Cpg initialize 126
Cpg initialize 127
Could not initialize Cluster Process Group API instance error 15

Comment 32 Jan Friesse 2011-03-24 15:47:11 UTC

Patch sent to ML

Comment 41 errata-xmlrpc 2011-07-21 07:46:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1012.html